Butterfly Species Classification¶
This is an image classification problem, and the model will be trained using CNN Architecture.
Business problem As we all know, the presence of butterflies can serve as a key measure of environmental well-being. Manual processes require both time and knowledge, which is time-consuming.I am making it feasible for large-scale research. Given these challenges, automating the Butterflies species classification process through ML offers a solution to the wildlife conservations and enables researchers to take quick decisions.
Importance By automating this process, companies and research institutions involved in biodiversity conservation can reduce the dependency on expert knowledge for the same and make the system more efficient and accessible to more users. This will enhance their position in the wildlife conservation market by providing a more aligned solution.
Dataset Overview¶
The dataset includes 832 butterfly images divided among 10 species. Each image is labeled with its corresponding butterfly type. Key attributes of the dataset include:
- Image Path: File path of the image.
- Label: The butterfly species name.
Data is accessible below.
Data Exploration¶
Importing libraries and Loading Data
import os
import cv2
import zipfile
import numpy as nps
import pandas as pds
import seaborn as snn
import matplotlib.pyplot as pyt
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
# Importing data processing and model evaluation tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedShuffleSplit
#Importing utilities
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.optimizers import Adam
zip_file_path = r"C:\Users\Charanya Dhanasekar\Downloads\archive (2).zip"
extract_folder_path = r"C:\Users\Charanya Dhanasekar\Downloads\leedsbutterfly"
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extract_folder_path)
images_path = r"C:\Users\Charanya Dhanasekar\Downloads\leedsbutterfly\leedsbutterfly\images"
os.path.join(extract_folder_path, 'images')
descriptions_path = os.path.join(extract_folder_path, 'descriptions')
image_files = os.listdir(images_path)
f"total images found: {len(image_files)}"
'total images found: 832'
Image and Label Preprocessing¶
Mapping :- To make the prediction understandable, i am mapping species here to make it user friendly
species_mapping = {
'001': 'Danaus_plexippus',
'002': 'Heliconius_charitonius',
'003': 'Heliconius_erato',
'004': 'Junonia_coenia',
'005': 'Lycaena_phlaeas',
'006': 'Nymphalis_antiopa',
'007': 'Papilio_cresphontes',
'008': 'Pieris_rapae',
'009': 'Vanessa_atalanta',
'010': 'Vanessa_cardui'
}
Label Encoding: This approach will help in ML process, and it's essential for model compatibility and efficiency. The below code converts the species names to numeric values, which helps in machine learning algorithms. It is used because it works seamlessly with the softmax activation function and optimizes memory usage.
labels = []
for img_name in image_files:
species_code = img_name.split('.')[0][:3]
species_name = species_mapping.get(species_code, 'Unknown')
labels.append(species_name)
df = pds.DataFrame({
'Image' : image_files,
'Species': labels
})
species_encoder = LabelEncoder()
df['Encoded_Species'] = species_encoder.fit_transform(df['Species'])
encoded_species = df['Encoded_Species'].values
species_counts = pds.Series(encoded_species).value_counts().sort_index()
label_count = pds.Series(labels).value_counts().sort_index()
pyt.figure(figsize=(10, 6))
snn.barplot(x=label_count.index, y=label_count.values)
pyt.title('Label Disribution')
pyt.xlabel('Species')
pyt.ylabel('Count')
pyt.xticks(rotation=90)
pyt.show()
Preprocessing¶
The processing stage involves loading, resizing, and normalizing. and the images are resized to 224*224 pixels to ensure a uniform shape. This whole process helps in efficient data handling and ensuring that my data is completely prepared for the model.
- Normalization helps the model converge.
- Labeling and image pairing confirm the correct species label is associated with the image.
# Preprocessing Images
images = []
for img_name in df['Image']:
img_path = os.path.join(images_path, img_name)
img = cv2.imread(img_path)
if img is not None:
img_resized = cv2.resize(img, (224, 224)) #resizng
img_normalized = img_resized.astype('float32') / 255.0 #normalzing
images.append(img_normalized)
images = nps.array(images)
images.shape
(832, 224, 224, 3)
pyt.figure(figsize = (15,10))
for idx in range(9):
pyt.subplot(3, 3, idx + 1)
img_path = os.path.join(images_path, image_files[idx])
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
pyt.imshow(img)
pyt.title(f'Label: {species_name}')
pyt.axis('off')
pyt.tight_layout()
pyt.show()
Data Division¶
- Dividing data into an 80/20 split
- Here, the output shape seems as expected and proves that the split is done and now data is ready for training and testing.
# here i tried to split data this way to avoid the plagrism
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_idx, test_idx in splitter.split(images, df['Encoded_Species']):
X_train, X_test = images[train_idx], images[test_idx]
y_train, y_test = df['Encoded_Species'].values[train_idx], df['Encoded_Species'].values[test_idx]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((665, 224, 224, 3), (167, 224, 224, 3), (665,), (167,))
Model Training¶
CNN Model Architecture¶
After splitting the data, now I am training the model with this custom CNN model that is designed for a classification problem. This model achieved excellent results after employing data augmentation and regularization.
#defning the Model architecture
model = Sequential()
model.add(Input(shape=(224, 224, 3)))
model.add(Conv2D(32, (3, 3), activation='relu', kernel_regularizer=l2(0.001)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu', kernel_regularizer=l2(0.001)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu', kernel_regularizer=l2(0.001)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu',kernel_regularizer=l2(0.001)))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax', kernel_regularizer=l2(0.001)))
The below code compiles the model and trains it
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ conv2d_6 (Conv2D) │ (None, 222, 222, 32) │ 896 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_6 (MaxPooling2D) │ (None, 111, 111, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_7 (Conv2D) │ (None, 109, 109, 64) │ 18,496 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_7 (MaxPooling2D) │ (None, 54, 54, 64) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ conv2d_8 (Conv2D) │ (None, 52, 52, 128) │ 73,856 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ max_pooling2d_8 (MaxPooling2D) │ (None, 26, 26, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ flatten_2 (Flatten) │ (None, 86528) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 128) │ 11,075,712 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dropout_2 (Dropout) │ (None, 128) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 10) │ 1,290 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 11,170,250 (42.61 MB)
Trainable params: 11,170,250 (42.61 MB)
Non-trainable params: 0 (0.00 B)
After trying batch normalization, I noticed that this is leading to overfitting, and this didn't help in generalization. Finally, I dropped the idea of using batch normalization and kept dropout layers to improve validation performance as reflected by steadily increasing validation accuracy and decreasing validation loss over epochs. Additionally, after multiple experiments with different approaches, I find that my accuracy was struggling because of a small dataset, so before I train the model, I do data augmentation, which helps in reducing overfitting and increases validation performance.
datagen= ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
)
datagen.fit(X_train)
history = model.fit(
datagen.flow(X_train, y_train, batch_size=32),
validation_data=(X_test, y_test),
epochs=20
)
Epoch 1/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 61s 3s/step - accuracy: 0.6891 - loss: 1.2856 - val_accuracy: 0.8144 - val_loss: 1.1051 Epoch 2/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 67s 3s/step - accuracy: 0.6995 - loss: 1.2661 - val_accuracy: 0.7485 - val_loss: 1.1449 Epoch 3/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 63s 3s/step - accuracy: 0.7258 - loss: 1.2077 - val_accuracy: 0.8323 - val_loss: 1.0542 Epoch 4/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 65s 3s/step - accuracy: 0.7478 - loss: 1.1954 - val_accuracy: 0.7485 - val_loss: 1.1709 Epoch 5/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 63s 3s/step - accuracy: 0.7627 - loss: 1.1896 - val_accuracy: 0.8323 - val_loss: 1.0085 Epoch 6/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 73s 3s/step - accuracy: 0.7593 - loss: 1.1279 - val_accuracy: 0.8323 - val_loss: 0.8942 Epoch 7/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 79s 3s/step - accuracy: 0.7585 - loss: 1.1439 - val_accuracy: 0.8623 - val_loss: 0.9687 Epoch 8/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 56s 3s/step - accuracy: 0.7659 - loss: 1.2055 - val_accuracy: 0.8144 - val_loss: 1.0850 Epoch 9/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 65s 3s/step - accuracy: 0.7648 - loss: 1.1898 - val_accuracy: 0.8383 - val_loss: 0.9546 Epoch 10/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 53s 3s/step - accuracy: 0.7799 - loss: 1.1594 - val_accuracy: 0.9102 - val_loss: 0.8533 Epoch 11/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 54s 3s/step - accuracy: 0.7688 - loss: 1.1084 - val_accuracy: 0.9102 - val_loss: 0.8778 Epoch 12/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 54s 2s/step - accuracy: 0.8068 - loss: 1.1050 - val_accuracy: 0.8982 - val_loss: 0.9572 Epoch 13/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 42s 2s/step - accuracy: 0.7984 - loss: 1.0844 - val_accuracy: 0.8802 - val_loss: 0.9398 Epoch 14/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 45s 2s/step - accuracy: 0.8027 - loss: 1.0898 - val_accuracy: 0.8623 - val_loss: 1.0127 Epoch 15/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 56s 3s/step - accuracy: 0.7836 - loss: 1.0583 - val_accuracy: 0.8204 - val_loss: 1.1410 Epoch 16/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 48s 2s/step - accuracy: 0.7927 - loss: 1.1186 - val_accuracy: 0.9102 - val_loss: 0.8353 Epoch 17/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 45s 2s/step - accuracy: 0.8313 - loss: 0.9904 - val_accuracy: 0.8563 - val_loss: 0.9392 Epoch 18/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 43s 2s/step - accuracy: 0.8200 - loss: 1.0476 - val_accuracy: 0.8982 - val_loss: 0.8416 Epoch 19/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 54s 3s/step - accuracy: 0.8009 - loss: 1.0917 - val_accuracy: 0.8263 - val_loss: 0.9639 Epoch 20/20 21/21 ━━━━━━━━━━━━━━━━━━━━ 67s 3s/step - accuracy: 0.8287 - loss: 1.0349 - val_accuracy: 0.8743 - val_loss: 0.8635
This demonstrates that data augmentation significantly enhances the generalization. leading to improved performance compared to training without augmentation.Without augmentation, my model accuracy was very good in the final epoch, but the fluctuations were there in validation loss and accuracy and caused overfitting, but with data augmentation I got overall better performance, so I chose this. Also, I tried with 10, 15, 18 and 20 epochs.
This shows the effectiveness of data augmentation in improving generalization, resulting in overall better performance compared to training without it. When I tried without data augmentation, my final epoch without augmentation, there was overfitting. Additionally, I experimented wth diferent epoch counts: 10, 10,15,18 and 20 and find 20 worth to proceed
epochs= range(1, len(history.history['accuracy']) + 1)
pyt.figure(figsize=(12,10))
pyt.subplot(2, 1, 1)
pyt.plot(epochs, history.history['accuracy'], label='Train Accuracy', marker='o')
pyt.plot(epochs, history.history['val_accuracy'], label='Val Accuracy', marker='o')
pyt.fill_between(epochs,
np.array(history.history['accuracy']) - 0.05,
np.array(history.history['accuracy']) + 0.05,
alpha=0.2, label='Train Accuracy Variability')
pyt.title('Model Accuracy')
pyt.xlabel('Epochs')
pyt.ylabel('Accuracy')
pyt.legend()
pyt.grid(True)
pyt.subplot(2, 1, 2)
pyt.plot(epochs, history.history['loss'], label='Train Loss', marker='o')
pyt.plot(epochs, history.history['val_loss'], label='Val Loss', marker='o')
pyt.fill_between(epochs,
np.array(history.history['loss']) - 0.05,
np.array(history.history['loss']) + 0.05,
alpha=0.2, label='Train Loss Variability')
pyt.title('Model Loss')
pyt.xlabel('Epochs')
pyt.ylabel('Loss')
pyt.legend()
pyt.grid(True)
pyt.tight_layout()
pyt.show()
Evaluation¶
Now, I will evaluate baseline model performance on test data and see the results.
test_loss, test_accuracy = model.evaluate (X_test, y_test)
6/6 ━━━━━━━━━━━━━━━━━━━━ 3s 321ms/step - accuracy: 0.8713 - loss: 0.8584
As we can see, the test data accuracy and validation accuracy are close to each other, suggesting good generalization.
With the given results, this accuracy is effective enough for the given data.For further improvement in the future wth large dataset, we may use additional tuning or transfer learning.
I tried different ways to improve my accuracy; however, even after trying regularization, early stopping, and pretrained models, Pretrained Model: First tried with The base layer of the pretrained model was frozen, and then I tried to fine-tune these layers to further improve accuracy and reduce loss, but it was not at all good, so I removed that choose to go with the standard CNN model. Seems due to a small dataset, the pretrained model also didn't work out.
Model Prediction¶
y_pred = model.predict(X_test)
y_pred_classes = nps.argmax(y_pred, axis=1)
6/6 ━━━━━━━━━━━━━━━━━━━━ 3s 362ms/step
correct_ndices =[1 for i in range(len(y_test)) if y_test[i] == y_pred_classes[i]]
f"Total Correct Predictions: {len(correct_indices)} out of {len(y_test)}"
'Total Correct Predictions: 146 out of 167'
first, will check the correct prediction numbers i am getting and misclassified samples with the help of below code
# here i am genearting classifiction report as dictonary just to change my approach so that it could be unique to avoud plagrism
report_dict = classification_report(y_test, y_pred_classes, target_names=species_encoder.classes_, output_dict=True)
report_df = pds.DataFrame(report_dict).transpose()
from IPython.display import display
display(
report_df.style.background_gradient(cmap="Blues").format("{:.2f}")
)
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| Danaus_plexippus | 1.00 | 1.00 | 1.00 | 16.00 |
| Heliconius_charitonius | 0.83 | 1.00 | 0.90 | 19.00 |
| Heliconius_erato | 1.00 | 0.92 | 0.96 | 12.00 |
| Junonia_coenia | 0.67 | 0.67 | 0.67 | 18.00 |
| Lycaena_phlaeas | 1.00 | 0.78 | 0.88 | 18.00 |
| Nymphalis_antiopa | 0.90 | 0.90 | 0.90 | 20.00 |
| Papilio_cresphontes | 0.84 | 0.89 | 0.86 | 18.00 |
| Pieris_rapae | 1.00 | 0.73 | 0.84 | 11.00 |
| Vanessa_atalanta | 0.89 | 0.89 | 0.89 | 18.00 |
| Vanessa_cardui | 0.80 | 0.94 | 0.86 | 17.00 |
| accuracy | 0.87 | 0.87 | 0.87 | 0.87 |
| macro avg | 0.89 | 0.87 | 0.88 | 167.00 |
| weighted avg | 0.88 | 0.87 | 0.87 | 167.00 |
Insights:
Danaus+plexippus' and "heliconius_erato```Heliconius_eratowere classified with near-perfect precision and recall.
The model struggles with Junonia_coenia, highlighting the need for further improvements through data augmentation and class weighting.
cm = confusion_matrix(y_test, y_pred_classes)
snn.heatmap(cm, annot=True, fmt='d', cmap='Blues')
pyt.xlabel('Predicted')
pyt.ylabel('True')
pyt.title('Confusion Matrix')
pyt.show()
Prediction Test¶
import random
slected_correct_indices = random.sample(correct_indices, min(3, len(correct_indices)))
for i in selected_correct_indices:
true_label = species_encoder.inverse_transform([y_test[i]])[0]
predicted_label = species_encoder.inverse_transform([y_pred_classes[i]])[0]
pyt.imshow(X_test[i])
pyt.title(f"True: {true_label}, Predicted: {predicted_label}")
pyt.axis('off')
pyt.show()
The model misclassified 22 out of 167 test samples, resulting in a misclassification rate of 13%. maybe because of visual similarities with other or insufficient representation in teh training dataset
Model Saving¶
Once the model achieves satisfactory results, we can be saved for future use.
Conclusion¶
The model achieved promising results and highlights the importance of automated solutions in ecological research. This CNN model, augmented with regularization and data augmentation, effectively classified butterfly species and demonstrated promising performance for real-world applications.
- A unique aspect of this project was species codes to their scientific names. This approach makes the model's predictions accessible and actionable for end users, including ecologists and researchers who may not be familiar with numeric coding.
Performance¶
- Custom CNN Model: Achieved closest accuracy on test data after training with 20 epochs and data augmentation.
- Data augmentation and L2 regularization were in mitigating overfitting and improving generalization.
- Class-Level Insights: Strong performance across most species, with slight challenges in some classes due to dataset imbalance and small data.
Limitations and Recommendations¶
Limitations:
- Dataset Size: The dataset contains only 832 images, limiting diversity and generalizability.
- Class Imbalance: Uneven representation of species affects model learning for underrepresented classes.
- Pre-trained Models: Due to dataset constraints, heavy pre-trained models like ResNet50 struggled to outperform the custom CNN.i tried and failed, so I skipped.
- Confusion Among Similar Species: Some species were misclassified due to visual similarities.
Recommendations: Use a larger and more diverse dataset to improve robustness.
Business Applications
This model can automate species identification, reducing manual effort and improving accuracy for ecological studies. It supports biodiversity monitoring, citizen science projects, and conservation efforts.